In most of the cases you will need to change the format of your
dataset because it will not be in the right format that you will need to
plot or analyse the data. tidyverse has a package called
dplyr which contains functions that help you to select
columns/rows, sort, combine and perform other data types of data
transformations. In the next sections we will look at different ways to
transform our dataset. Now that we already know the basics of
visualising data with ggplot we will also learn how to
visualise other plots with the transformed dataset as we go along.
To learn more about dplyr please look at the following
resources:
If you find these resources difficult to understand, return to these after completing the course.
Before we move onto the different ways that we can transform our data, we’ll discuss one of the most powerful additions to R, pipes.
Pipes let you take the output of one function and send it directly to
the next, which is useful when you need to do many things to the same
dataset. Pipes in R look like %>% and are made available
via the magrittr package, installed
automatically with dplyr.
Let’s say we are interested only in the data that do not contain any missing values. Using pipes, we do the following:
surveys %>%
drop_na()This takes the argument on the left (surveys) and passes
it on to the function after it (drop_na, which removes all
rows with missing values).
If we want to create a new object with the transformed data we can assign it a new name as below:
surveys_complete <- surveys %>%
drop_na()
surveys_completeAlthough at this point pipes might not seem like a revolutionary invention, they become more powerful when combining multiple operations. This we’ll see next.
To select columns of a data frame or tibble, use
theselect function. The first argument is the data frame or
tibble you are working on (in our example it is surveys,
which we pipe through), and the subsequent arguments are the columns to
keep.
# Extract species_id, weight, hindfoot_lenth, year and sex columns from surveys dataset.
surveys %>%
select(species_id, weight, hindfoot_length, year, sex)To select all columns except certain ones, put a
- in front of the column to exclude it.
# Select all columns of the surveys dataset apart from record_id and species_id columns.
surveys %>%
select(-record_id, -species_id)To remove rows from a data frame or tibble use the
filter function from the dplyr package. The
first argument is the data frame or tibble to perform the filtering on
(in this case we pipe the data through) and the next arguments are the
conditions on which to keep the rows.
To choose rows based on a specific condition, use the
filter function as follows:
# Keep only the observations of animals collected from 1995 onwards from the surveys dataset.
surveys %>%
filter(year >= 1995)You can filter on multiple conditions:
# Keep only the observations of animals collected from 1995 onwards
# that are female from the surveys dataset.
surveys %>%
filter(year >=1995,
sex == "F")To quote or not to quote?
The sex column is a character and thus needs to be
quoted, whereas the year column is numerical and does not.
Also note that the filter arguments could have been written on a single
line, but it is useful to break up your code sometimes to make it more
readable.
Subset the surveys_complete data to keep only the
species_id, weight,
hindfoot_length, year and sex
columns and the animals collected on and after 1995. Then plot a scatter
plot of weight (x-axis) against
hindfoot_length (y-axis) using this transformed dataset. Do
all the above using pipes, without creating any variables.
surveys_complete %>%
# Select columns
select(species_id, weight, hindfoot_length, year, sex) %>%
# Filter rows
filter(year >= 1995) %>%
# Plot transformed data
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
Plot all the animals in the surveys_complete dataset as
weight (x-axis) against hindfoot_length
(y-axis). Use the dataset created above which contains only the animals
that were collected on and after 1995 and highlight these points in red
in the plot. Hint: create a new object called
survey_recent that contains the dataset created above.
survey_recent <- surveys_complete %>%
select(species_id, weight, hindfoot_length, year, sex) %>%
filter(year >= 1995)
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(data = surveys_complete) +
geom_point(data = survey_recent, colour = "red", alpha = 0.4)
Note: In the example above we did not specify the
data argument in the ggplot function because
the arguments in the ggplot function should be common
throughout the whole plot. In this case we specified the respective data
as separate layers in 2 different geom_point plots; the
first plot is the data with all the animals in
surveys_complete (points in black), the second plot is
layered on top of the first plot and is a subset of the
surveys_complete dataset (points in red).
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use the
mutate function.
To create a new column of weight in kg:
surveys_complete %>%
mutate(weight_kg = weight / 1000)You can also create a second new column based on the first new column
within the same call of mutate():
surveys_complete %>%
mutate(weight_kg = weight / 1000,
weight_lbs = weight_kg * 2.20462)There are other ways on how to create new columns. Refer to the dplyr cheat sheet Make New Variables section.
Using the surveys_complete dataset, create a boxplot
for each year on the x-axis and weight in kg on the y-axis. Use the
subset dataset that was transformed in the previous challenge,
i.e. the one that has the recent observations (from 1995
onwards). As before try to do all the operations using pipes, without
creating variables.
Hint: You will need to use the as_factor
function to convert the year column to factor
type.
Your first attempt would be to use the code we used before to subset
the surveys_complete dataset, but also adding the
mutate function to add a new weight_kg column.
Further more you will need to also add the ggplot function
to draw a boxplot:
surveys_complete %>%
# Add weight_kg column
mutate(weight_kg = weight / 1000) %>%
# Select columns
select(species_id, weight_kg, hindfoot_length, year, sex) %>%
# Filter rows
filter(year >= 1995) %>%
# Plot transformed data
ggplot(mapping = aes(x = year, y = weight_kg)) +
geom_boxplot()#> Warning: Continuous x aesthetic -- did you forget aes(group=...)?

You might notice that this code produces a warning message about the
continuous data type of the x-axis and also creates a box plot with one
single box and whisker averaged over all the years on the x-axis, rather
than one for each year in the dataset as requested. As shown in the ggplot2
cheat sheet, box plots display a discrete variable on the x-axis and
a continuous variable on the y-axis. At the moment year is
treated as a continuous variable since its data type is
double (you can see this from the tibble). To fix this, we
need to convert the data type of the year column to a
categorical one (factor) so that each year can be plotted
as a separate box and whisker.
The forcats package that is part of
tidyverse has the as_factor
function which does just that; it converts a variable into a factor. See
the updated code below, which uses the as_factor
function:
surveys_complete %>%
# Add weight_kg column
mutate(weight_kg = weight / 1000) %>%
# Select columns
select(species_id, weight_kg, hindfoot_length, year, sex) %>%
# Filter rows
filter(year >= 1995) %>%
# Convert the year column to a factor
mutate(year = as_factor(year)) %>%
# Plot transformed data
ggplot(mapping = aes(x = year, y = weight_kg)) +
geom_boxplot()
Notice that since we need to keep only observations that have been
observed from 1995 onwards, we need to convert the year
column to a factor after we filter the rows, as
otherwise the row filtering operation will not be possible. This is
because the new year column of type factor
does not have any ordering in its categories so perfoming
year >= 1995 would not be possible.
It is also possible to convert the year variable
directly from within ggplot, thereby skipping the
mutate step from above. We can do this using the updated
code below (output is the same as above):
surveys_complete %>%
# Add weight_kg column
mutate(weight_kg = weight / 1000) %>%
# Select columns
select(species_id, weight_kg, hindfoot_length, year, sex) %>%
# Filter rows
filter(year >= 1995) %>%
# Convert the year variable and plot transformed data
ggplot(mapping = aes(x = as_factor(year), y = weight_kg)) +
geom_boxplot()
Note on the boxplot
As mentioned the boxplot is used to visualise the distribution of a
continuous variable. In the example above we displayed the distribution
of weight over the different years. The figure below
illustrates the different components of the boxplot. The bottom line of
the box represents the 25th quantile, middle line the 50th quantile (or
median) and the top line of the box is the 75th quantile. The whiskers
can be thought of as defining the boundary of the confidence intervals
of the distribution. The dots outside these whiskers are the
outliers.
To sort your data dplyr provides function
arrange.
# Sort weight in ascending order (default)
surveys_complete %>%
arrange(weight)To sort your data in descending order you will need to use
desc().
# Sort weight in descending order
surveys_complete %>%
arrange(desc(weight))You can sort your dataset based on the values of multiple columns:
# Sort weight in ascending order and hindfoot_length in descending order
surveys_complete %>%
arrange(weight, desc(hindfoot_length))As you can see from the result returned, the animals with the
smallest weight are at the top. When there is a tie, i.e., more
than one animal has the same weight, the animals are sorted in
descending order of hindfoot_length. As you can see, the
subset of animals with weight of 4 have been sorted in
descending order based on hindfoot_length.
Creating summaries of your data would be a good way to start describing the variable you are working with. Summary statistics are a good example of how one can summarise data. We will not cover details about summary statistics in this course, but we will look at how we can summarise data in R.
countThe most straightforward summary data is counting the the number of
observations or frequencies. Frequencies are normally calculated when
working with discrete variables that have a finite number of values,
such as categorical data. In our surveys_complete dataset,
let us obtain the frequecies of male and female animals present. We can
do this by counting the number of “M” and “F” present in the dataset. To
do this use the dplyr function count as
follows:
surveys_complete %>%
count(sex)#> # A tibble: 2 × 2
#> sex n
#> <chr> <int>
#> 1 F 14520
#> 2 M 16031
As you can see count has grouped the categories present
in the sex column and returned the frequency of each
category. If we wanted to count combination of factors, such as
sex and species, we would specify the first
and the second factor as the arguments of count():
surveys_complete %>%
count(sex, species) #> # A tibble: 41 × 3
#> sex species n
#> <chr> <chr> <int>
#> 1 F albigula 606
#> 2 F baileyi 1593
#> 3 F eremicus 539
#> 4 F flavus 711
#> 5 F fulvescens 55
#> 6 F fulviventer 15
#> 7 F hispidus 91
#> 8 F leucogaster 436
#> 9 F leucopus 16
#> 10 F maniculatus 354
#> # … with 31 more rows
plot_type
surveyed?surveys_complete %>%
count(plot_type) surveys_complete %>%
count(sex, species) %>%
arrange(species, desc(n))If you would like to plot the distribution of a single
continuous variable the frequency will be automatically
calculated, so you do not need to use count() to calculate
the frequency beforehand. The x-axis is automatically divided into bins
and the number of observations of the continuous variable in each bin is
shown as a bar in the histogram. In the ggplot2 package a
histogram can be plotted using the geom_histogram
function.
Let us plot a histogram for the continuous variable
weight:
ggplot(surveys_complete, aes(weight)) +
geom_histogram()
You can identify display categories in the histogram by plotting a
stacked histogram which will show categories for each group stacked on
top of each other. This is done by using the fill argument
in the aesthetic function. If we want to display sex in our
weight histogram:
ggplot(surveys_complete, aes(weight, fill = sex)) +
geom_histogram(bins = 100)
Note that the default number of bins in a histogram is 30. To get
have a granular display you can increase the number of bins by using the
argument bins in the geom_histogram function
as above.
There are other plots that can be used for a single continuous variable (see ONE VARIABLE continuous section on ggplot2 cheat sheet).
summarise()When working with continuous variables, one of the most popular
summary statistic is the mean. If we try to caclulate the mean on
weight in the surveys_complete dataset we get
an error:
surveys_complete %>%
mean_weight = mean(weight)#> Error in mean(weight): object 'weight' not found
This is because in dplyr you will need to use the
summarise function to be able to create summaries of your
data.
The summarise function is used when you want to reduce
multiple values in your columns to a single value. If we try to
calculate the mean using summarise() this time:
surveys_complete %>%
summarise(mean_weight = mean(weight))#> # A tibble: 1 × 1
#> mean_weight
#> <dbl>
#> 1 41.8
As seen from the result, the mean of the values present
in the weight column is 41.8. This value is the mean of
weight of the entire data set. We already know that there
are many different groups within the data set, so this is not very
informative. It would therefore be good to calculate mean
values for groups within the data. We cover how to do this next.
In the examples above we learned how to summarise data over all
observations, e.g., we calculated the mean over all
observations using the summarise function. However, in data
analysis, especially when dealing with big data, a common approach to
data exploration is the split-apply-combine
strategy. The idea behind this strategy is to split the data into more
manageable pieces, apply any operations required on the data
independently on each piece and then combine the results together. The
figure below illustrates the approach that is done in the
split-apply-combine approach.
Let’s work on an example on how we can apply the split-apply-combine
strategy on the surveys_complete dataset. We would like to
split the data by the different categories present in the
sex column and calculate the mean weight for each category.
We can do this easily by using the group_by function in the
dplyr package:
surveys_complete %>%
group_by(sex) %>%
summarise(mean_weight = mean(weight))#> # A tibble: 2 × 2
#> sex mean_weight
#> <chr> <dbl>
#> 1 F 41.5
#> 2 M 42.1
You can also group by multiple columns:
surveys_complete %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight))#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
#> # A tibble: 46 × 3
#> # Groups: sex [2]
#> sex species_id mean_weight
#> <chr> <chr> <dbl>
#> 1 F BA 9.16
#> 2 F DM 41.6
#> 3 F DO 48.5
#> 4 F DS 117.
#> 5 F NL 154.
#> 6 F OL 30.8
#> 7 F OT 24.8
#> 8 F OX 22
#> 9 F PB 30.2
#> 10 F PE 22.8
#> # … with 36 more rows
Once the data are grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys_complete %>%
group_by(sex, species_id) %>%
summarise(mean_weight = mean(weight),
min_weight = min(weight))#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
#> # A tibble: 46 × 4
#> # Groups: sex [2]
#> sex species_id mean_weight min_weight
#> <chr> <chr> <dbl> <dbl>
#> 1 F BA 9.16 6
#> 2 F DM 41.6 10
#> 3 F DO 48.5 12
#> 4 F DS 117. 45
#> 5 F NL 154. 32
#> 6 F OL 30.8 10
#> 7 F OT 24.8 5
#> 8 F OX 22 22
#> 9 F PB 30.2 12
#> 10 F PE 22.8 11
#> # … with 36 more rows
Use group_by() and summarise() to find the
mean, min, and max hindfoot length for each species (using
species_id). Also add the number of observations in a
column called n_obs (hint: see ?n).
surveys_complete %>%
group_by(species_id) %>%
summarise(
mean_hindfoot_length = mean(hindfoot_length),
min_hindfoot_length = min(hindfoot_length),
max_hindfoot_length = max(hindfoot_length),
n_obs = n()
)#> # A tibble: 24 × 5
#> species_id mean_hindfoot_length min_hindfoot_length max_hindfoot_length n_obs
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 BA 13 6 16 45
#> 2 DM 36.0 16 50 9702
#> 3 DO 35.6 26 64 2781
#> 4 DS 50.0 39 58 2023
#> 5 NL 32.2 21 42 1045
#> 6 OL 20.5 12 39 905
#> 7 OT 20.3 13 50 2062
#> 8 OX 20.4 19 21 5
#> 9 PB 26.1 2 47 2762
#> 10 PE 20.2 11 30 1197
#> # … with 14 more rows
What was the heaviest animal measured in each year? Return the
columns year, genus, species, and
weight.
surveys_complete %>%
group_by(year) %>%
filter(weight == max(weight)) %>%
select(year, genus, species_id, weight) %>%
arrange(year)#> # A tibble: 28 × 4
#> # Groups: year [26]
#> year genus species_id weight
#> <dbl> <chr> <chr> <dbl>
#> 1 1977 Dipodomys DS 149
#> 2 1978 Neotoma NL 232
#> 3 1978 Neotoma NL 232
#> 4 1979 Neotoma NL 274
#> 5 1980 Neotoma NL 241
#> 6 1980 Neotoma NL 241
#> 7 1981 Neotoma NL 251
#> 8 1982 Neotoma NL 252
#> 9 1983 Neotoma NL 256
#> 10 1984 Neotoma NL 259
#> # … with 18 more rows
Now that you have learned how to use
dplyr to transform your raw data, you may
want to export these new datasets to share them with your collaborators
or for archival.
Similar to the read_csv function used for reading CSV
files into R, there is a write_csv
function that generates CSV files from data frames and tibbles which
is also present in the readr package.
Before using write_csv(), note that we are going to
write the data to the data_processed folder. This is to
keep these data separate from our raw data in data_raw.
Let’s save the surveys_complete tibble in
data_processed/surveys_complete.csv file:
write_csv(surveys_complete, "data_processed/surveys_complete.csv")The topics we have covered so far are just a small fraction of what is possible in R. Below is information on some more advanced topics of data wrangling and plotting. The topics and challenges are hopefully self-explanatory!
The ggplot2 package has a way of creating different
plots based on the different categories in the data. This is known as
facetting. With facetting we do not need to use
group_by() to split the data into different groups to be
able to plot the different categories in different plots as
ggplot2 does this automatically.
There are two types of facet functions:
facet_wrap() arranges the different plots into muliple
rows and columns to cleanly fit on one page.facet_grid() plots all the categories in 1 row or 1
column.Let us see this in action.
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
As you can see, each genus has been plotted as a separate plot. It is
now clear which are the genera that were observed the most. Another
advantage of facetting is that it uses a common axes and all plots are
aligned to the same values on the axes, making the different plots
comparable. If you want to have different axes for each plot you can do
so by using the scales argument.
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus), scales = "free")
The pattern of the graphs that before were hardly visible, e.g., Baiomys, is now clear as the axes have been rescaled to fit the data. This is the main advantage of using free scales. The disadvantage is that the different plots are not comparable as before.
If we would like to see if there is any difference between the
sex, we can do this by adding sex as another
grouping to count(). Let’s assign it to an object
yearly_sex_counts.
yearly_sex_counts <- surveys_complete %>%
count(year, genus, sex)yearly_sex_counts will now look like:
#> # A tibble: 389 × 4
#> year genus sex n
#> <dbl> <chr> <chr> <int>
#> 1 1977 Chaetodipus F 3
#> 2 1977 Dipodomys F 103
#> 3 1977 Dipodomys M 119
#> 4 1977 Onychomys F 2
#> 5 1977 Onychomys M 1
#> 6 1977 Perognathus F 14
#> 7 1977 Perognathus M 8
#> 8 1977 Peromyscus M 2
#> 9 1977 Reithrodontomys F 1
#> 10 1977 Reithrodontomys M 1
#> # … with 379 more rows
This should now allow us to also split by sex. We can
use colour to distinguish between the sex categories:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
Let us do the same thing with facet_grid() so that we
can understand the difference between the two facetting techniques in
the ggplot2 package. With facet_grid() you
specify what variable you would like to split on as in the
rows or cols arguments:
surveys_complete %>%
group_by(year, genus) %>%
count() %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
# Display the genera as columns
facet_grid(cols = vars(genus))
As you can see facet_grid() placed all the categories of
genus in 1 row, unlike facet_wrap() which have
spread them over multiple rows to fit well in 1 page. Let us split the
plots by sex as well by plotting sex as the
rows:
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
More information on further functionality of facetting can be found in the
facet_wrap()andfacet_grid()documentation.
Challenge
Instead of splitting the plots based on
sexdisplay thesexas different coloured line graphs in the same plot.ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) + geom_line() + facet_grid(cols = vars(genus))
In the data exploration workflow we discussed how to structure our data leading to the four rules defining a tidy dataset:
Here we examine the fourth rule: Each type of observational unit forms a table.
In surveys , the rows of surveys contain
the values of variables associated with each record (the unit), values
such as the weight or sex of each animal associated with each record.
What if instead of comparing records, we wanted to compare the different
mean weight of each genus between plots? (Ignoring
plot_type for simplicity).
We’d need to create a new table where each row (the unit) is
comprised of values of variables associated with each plot. In practical
terms this means the values in genus would become the names
of column variables and the cells would contain the values of the mean
weight observed on each plot.
Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.
The opposite transformation would be to transform column names into values of a variable.
We can do both these of transformations with two tidyr
functions, pivot_wider() and
pivot_longer().
pivot_wider()pivot_wider() takes three principal arguments:
datanames_from column variable whose values will become
new column names.values_from column variable whose values will fill
the new column variables.Further arguments include values_fill which, if set,
fills in missing values with the value provided.
Let’s use pivot_wider() to transform surveys to find the
mean weight of each genus in each plot over the entire survey period. We
use filter(), group_by() and
summarise() to filter our observations and variables of
interest, and create a new variable for the mean_weight. We
use the pipe as before too.
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarise(mean_weight = mean(weight))#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.
str(surveys_gw)This yields surveys_gw where the observations for each
plot are spread across multiple rows, 196 observations of 3 variables.
Using pivot_wider() to key on genus with
values from mean_weight this becomes 24 observations of 11
variables, one row for each plot. We again use pipes:
surveys_wider <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
str(surveys_wider)
We could now plot comparisons between the weight of genera in
different plots, although we may wish to fill in the missing values
first, using the values_fill argument.
Note: be careful when replacing NA with
zero, since a mean_weight of would be recognised as an
actual measurement, and not as a missing value!
surveys_gw %>%
pivot_wider(
names_from = genus,
values_from = mean_weight,
values_fill = list(mean_weight = 0)
)pivot_longerThe opposing situation could occur if we had been provided with data
in the form of surveys_wider, where the genus names are
column names, but we wish to treat them as values of a genus variable
instead.
In this situation we are lenghtening the data by taking the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer() takes four principal arguments:
datacols containing the names of the columns we want to
pivot into a longer formatnames_to containing a string specifying the name of
the column to create from the data stored in the column names of
datavalues_tocontaining a string specifying the name of
the column to create from the data stored in cell values.To recreate surveys_gw from surveys_wider
we would create a column using names_to called
genus. We would also need to create a column called
mean_weight using the values_to argument.
Lastly, we would use all columns apart from plot_id to
pivot into a longer format. Here we drop plot_id column
with a minus sign.
surveys_longer <- surveys_wider %>%
pivot_longer(
cols = -plot_id,
names_to = "genus",
values_to = "mean_weight")
str(surveys_longer)
Note that now the NA genera are included in the longer
format. Making data wider and then longer can be a useful way to balance
out a dataset so every replicate has the same composition.
We could also have used a specification for what columns to include.
This can be useful if you have a large number of identifying columns,
and it’s easier to specify what to gather than what to leave alone. And
if the columns are in a row, we don’t even need to list them all out -
just use the : operator!
surveys_wider %>%
pivot_longer(
cols = Baiomys:Spermophilus,
names_to = "genus",
values_to = "mean_weight") %>%
head()Challenge
- Take the
surveysdata frame and make it wider, usingyearas columns,plot_idas rows, and the number of genera per plot as the values. You will need to summarise before reshaping, and use the functionn_distinct()to get the number of unique genera within a particular chunk of data. It’s a powerful function! See?n_distinctfor more.surveys_wider_genera <- surveys %>% group_by(plot_id, year) %>% summarise(n_genera = n_distinct(genus)) %>% pivot_wider(names_from = year, values_from = n_genera)#> `summarise()` has grouped output by 'plot_id'. You can override using the #> `.groups` argument.head(surveys_wider_genera)
- Now take that data frame and
pivot_longer()it again, so each row is a uniqueplot_idbyyearcombination.surveys_wider_genera %>% pivot_longer( cols = -plot_id, names_to = "year", values_to = "n_genera")
- The
surveysdata set has two measurement columns:hindfoot_lengthandweight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, usepivot_longer()to create a dataset where we have a key column calledmeasurementand avaluecolumn that takes on the value of eitherhindfoot_lengthorweight. Hint: You’ll need to specify which columns are being pivoted.surveys_long <- surveys %>% pivot_longer( cols = c(hindfoot_length, weight), names_to = "measurement", values_to = "value")
- With this new data set, calculate the average of each
measurementin eachyearfor each differentplot_type. Then make the data set wider usingpivot_wider()with a column forhindfoot_lengthandweight. Hint: You only need to specify the name and value columns forpivot_wider().surveys_long %>% group_by(year, measurement, plot_type) %>% summarise(mean_value = mean(value, na.rm = TRUE)) %>% pivot_wider(names_from = measurement, values_from = mean_value)#> `summarise()` has grouped output by 'year', 'measurement'. You can override #> using the `.groups` argument.
Often data is spread out over multiple tables, instead of in one
large, single table. Why collect data in multiple tables? - Not
efficient to include all information in a single table. - Redundant
information makes it more difficult to update or revise data. - Make
changes in one place, not hundreds of places. - Use multiple tables -
Each table contains a single kind of information, for example in our
case we could split the data into three tables: - surveys:
information about individuals - species: information about
species - plots: information about plots - If a species
name changes we only need to change it in the species table - Connect
tables using joins to describe relationships between tables
To illustrate the joining of different tables we will create two
tables: hindfoot_length_join and weight_join,
which contain the average hindfoot_length and
weight per genus, respectively. To simplify the example, we
are only using three genera, two of which are common between the tables.
These are taken using the slice function. For more
information, see ?slice.
# Create a summary table for mean hindfoot_length
hindfoot_length_join <- surveys %>%
group_by(genus) %>%
summarise(hindfoot_length = mean(hindfoot_length, na.rm = TRUE)) %>%
slice(4,8,11)
# Create a summary table for mean weight
weight_join <- surveys %>%
group_by(genus) %>%
summarise(weight = mean(weight, na.rm = TRUE)) %>%
slice(4,8,15)
hindfoot_length_join#> # A tibble: 3 × 2
#> genus hindfoot_length
#> <chr> <dbl>
#> 1 Baiomys 13
#> 2 Chaetodipus 23.9
#> 3 Dipodomys 37.9
weight_join#> # A tibble: 3 × 2
#> genus weight
#> <chr> <dbl>
#> 1 Baiomys 8.6
#> 2 Chaetodipus 24.2
#> 3 Perognathus 8.38
As can be seen above, the Baiomys and Chaetodipus
genera are common between the two tables. Dipodomys only occurs
in hindfoot_length_join and Perognathus only in
weight_join.
Joining of tables is done using the mutating join
functions in dplyr
documentation. All of join functions take three
arguments, two table names (the tables to join, the left-hand side and
right-hand side) and a common key defined in by =. Here we
look at the four most commonly used join functions, which
are illustrated in the image below.
left-join. This joins matching rows from the right-hand
side table to the left-hand side table.right_join. This joins matching rows from the left-hand
side table to the right-hand side table.inner_join. This retains only rows that occur in both
tables.full_join. This retains all values from
all rows.In this case we want to join the data by genus, so we
use by = genus as the common key.

Though the default visualisation of ggplot2 plots is
already at a good standard, there are several ways one can improve even
further the visualisations.
Let us start customising the last plot we have plotted by renaming
the axes and adding a title to the plot. This is done by using the labs
function:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals")
The major item that needs fixing in the plot is the text on the
x-axis as this crammed and is not readable at the moment. This is mainly
due to the fact that the size of the plot is dependent on the size of
the window (in this case RStudio). You can work around this by saving
your plot to a file and specifying the width of the plot (see Saving a plot to a file section).
Themes in the ggplot2 package control the
display of all non-data elements of the plot. Let us start customising
the text on the x-axis by changing its size and position using the
theme function. Note that theme() has several
other arguments and you can read more about them in the theme()
documentation.
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5),
axis.text.y = element_text(size = 7),
strip.text=element_text(size = 7, angle = 45))
With the plot already looking better, the last thing we would like to
change is the legend. Legends are very tricky in ggplot2 as
the fuction to use is determined by the data that is being displayed. In
this case the legend has been created based on colour groupings.
Therefore we can change the lengend title, categories and colour as
follows:
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, colour = sex)) +
geom_line() +
facet_grid(cols = vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of animals") +
theme(axis.text.x = element_text(size = 7, angle = 90, vjust = 0.5),
axis.text.y = element_text(size = 7),
strip.text = element_text(size = 7, angle = 45)) +
scale_colour_brewer("Sex",
palette = "Set1",
breaks = c("F", "M"),
labels = c("Female", "Male"))
Note: If you would like to see what other palettes are available please see http://colourbrewer2.org/#type=qualitative&scheme=Set1&n=3.
ggplot2 has a set of themes that can be used to change
the overall appearance of the graph without much effort. For example, if
we create the first plot again and apply the
theme_bw() theme we get a more simpler white
background:
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point() +
theme_bw()
A list of themes can be found in the ggplot2
documentation.
Challenge
Use what you just learned to create a plot that depicts how the average weight of each species changes through the years.
surveys_complete %>% group_by(year, species) %>% summarise(avg_weight = mean(weight)) %>% ggplot(mapping = aes(x = year, y = avg_weight)) + geom_line() + facet_wrap(facets = vars(species)) + theme_bw() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5))#> `summarise()` has grouped output by 'year'. You can override using the #> `.groups` argument.